Skip to content

Automated Test: kafka-consumer-parallel-after #319

Conversation

admin-coderabbit
Copy link
Owner

@admin-coderabbit admin-coderabbit commented Feb 4, 2026

This pull request was automatically created by @coderabbitai/e2e-reviewer.

Batch created pull request.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added "thread-queue-parallel" processing mode for result and uptime consumers with guaranteed per-group ordering and improved offset tracking and commit coordination.
  • Refactor

    • Updated configuration options and help text for worker settings to better describe parallelism capacity.
  • Tests

    • Added comprehensive test suites for queue-based processing, covering offset tracking, commit semantics, order preservation, concurrent processing, and Kafka integration validation.

One potential problem we have with batch processing is that any one slow
item will clog up the whole batch. This pr implements a queueing method
instead, where we keep N queues that each have their own workers.
There's still a chance of individual items backlogging a queue, but we
can try increased concurrency here to reduce the chances of that
happening

<!-- Describe your PR here. -->
@coderabbit-eval
Copy link

coderabbit-eval bot commented Feb 4, 2026

📝 Walkthrough

Walkthrough

The PR introduces a thread-queue-parallel processing mode for Kafka result consumption. It adds a modular in-process framework with per-partition offset tracking, multiple worker queues with per-group ordering guarantees, and integrates this into the results consumer factory. Comprehensive tests validate queue mechanics, offset management, ordering semantics, and end-to-end integration.

Changes

Cohort / File(s) Summary
Configuration Updates
src/sentry/consumers/__init__.py
Added "thread-queue-parallel" as a new uptime mode choice; updated max-workers help text to describe overall parallelism capacity in parallel mode.
Core Queue Processing Framework
src/sentry/remote_subscriptions/consumers/queue_consumer.py
Introduced complete in-process, in-order, multi-queue processing framework with: WorkItem dataclass for partition/offset tracking, OffsetTracker for thread-safe per-partition offset state management, OrderedQueueWorker for dedicated queue processing, FixedQueuePool for multi-queue distribution via consistent hashing, and SimpleQueueProcessingStrategy to wire components together with periodic commit loop and metrics reporting.
Results Consumer Integration
src/sentry/remote_subscriptions/consumers/result_consumer.py
Extended ResultsStrategyFactory to support "thread-queue-parallel" mode with new thread_queue_parallel flag, queue_pool attribute, create_thread_queue_parallel_worker() method, and conditional routing in create_with_partitions(); added queue pool shutdown in factory shutdown().
Queue Consumer Tests
tests/sentry/remote_subscriptions/consumers/test_queue_consumer.py
Comprehensive test suite covering OffsetTracker semantics, FixedQueuePool queue assignment and ordering, SimpleQueueProcessingStrategy end-to-end processing, offset commit behavior, and factory integration with thread-queue-parallel mode.
Uptime Consumer Tests
tests/sentry/uptime/consumers/test_results_consumer.py
Extensive tests for thread-queue-parallel processing including order preservation, concurrent multi-subscription handling, offset commit verification, error handling, offset gaps, graceful shutdown, and Kafka-backed integration tests with AdminClient and offset verification.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant SubmitHandler as submit()
    participant Decoder as decoder
    participant GroupingFn as grouping_fn
    participant QueuePool as FixedQueuePool
    participant WorkQueue as WorkQueue<br/>(per group)
    participant Worker as OrderedQueueWorker
    participant Processor as result_processor
    participant Tracker as OffsetTracker
    participant CommitLoop as _commit_loop
    participant Commit as commit_function

    Client->>SubmitHandler: submit(message)
    SubmitHandler->>Decoder: decode(payload)
    Decoder-->>SubmitHandler: result
    SubmitHandler->>GroupingFn: get_group_key(result)
    GroupingFn-->>SubmitHandler: group_key
    SubmitHandler->>Tracker: add_offset(partition, offset)
    SubmitHandler->>QueuePool: submit(group_key, WorkItem)
    QueuePool->>QueuePool: get_queue_for_group(group_key)
    QueuePool->>WorkQueue: put(WorkItem)

    Worker->>WorkQueue: get(WorkItem)
    Worker->>Processor: process(group_key, result)
    Processor-->>Worker: ✓
    Worker->>Tracker: complete_offset(partition, offset)

    CommitLoop->>CommitLoop: periodic timer
    CommitLoop->>Tracker: get_committable_offsets()
    Tracker-->>CommitLoop: dict[partition→offset]
    CommitLoop->>Commit: offsets_dict
    Commit->>Tracker: mark_committed(partition, offset)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Poem

🐰 Queues and workers, threads aligned,
Offsets tracked with ordered mind,
Group keys hash to queues so fair,
Parallel hops through the air,
Committing safe what once took care! 🚀

🚥 Pre-merge checks | ✅ 1 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Title check ⚠️ Warning The title does not clearly describe the changeset's main contribution: implementing a thread-queue-parallel processing framework with comprehensive tests. Use a more descriptive title that reflects the primary change, such as 'Add thread-queue-parallel processing strategy for Kafka consumers' or 'Implement FixedQueuePool and SimpleQueueProcessingStrategy for in-order message processing'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch kafka-consumer-parallel-after

Comment @coderabbit eval help to get the list of available commands and usage tips.

@coderabbit-eval coderabbit-eval bot closed this Feb 4, 2026
Copy link

@coderabbit-eval coderabbit-eval bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/sentry/remote_subscriptions/consumers/queue_consumer.py`:
- Around line 67-98: The get_committable_offsets implementation currently uses
range(start, max_offset+1) which can iterate over many absent offsets; replace
that range-based scan with an iteration over only the present offsets (sort the
partition's all_offsets and filter >= start) and detect contiguity by tracking
the next expected offset. Specifically, in get_committable_offsets, for each
partition use sorted_offsets = sorted(o for o in all_offsets if o >= start), set
expected = start, then loop sorted_offsets: if offset == expected and offset not
in outstanding update highest_committable and expected += 1 else break; keep
using last_committed, outstanding, all_offsets and highest_committable as
before.
🧹 Nitpick comments (3)
tests/sentry/uptime/consumers/test_results_consumer.py (2)

2017-2017: Consider using ClassVar annotation for class-level mutable attribute.

Static analysis suggests annotating pytestmark with ClassVar for clarity.

+from typing import ClassVar
+
 class ProcessResultThreadQueueParallelKafkaTest(UptimeTestCase):
-    pytestmark = [requires_kafka]
+    pytestmark: ClassVar = [requires_kafka]

2142-2146: Log the exception during cleanup for better debugging.

Silently catching exceptions during cleanup can mask issues during test debugging. Consider logging the exception.

🔧 Proposed fix
         finally:
             try:
                 admin_client.delete_topics([test_topic])
-            except Exception:
-                pass
+            except Exception as e:
+                logger.warning("Failed to delete test topic %s: %s", test_topic, e)
src/sentry/remote_subscriptions/consumers/queue_consumer.py (1)

344-345: The timeout parameter in join() is ignored.

The join method ignores the timeout parameter and just calls close(). Looking at other processing strategies in the codebase (e.g., SpanFlusher.join), they typically honor the timeout parameter for graceful shutdown coordination.

Consider passing the timeout through to the underlying operations:

🔧 Proposed fix
     def join(self, timeout: float | None = None) -> None:
-        self.close()
+        self.shutdown_event.set()
+        if timeout is not None:
+            self.commit_thread.join(timeout=timeout / 2)
+            # Use remaining time for queue pool
+            self.queue_pool.wait_until_empty(timeout=timeout / 2)
+        else:
+            self.commit_thread.join(timeout=5.0)
+        self.queue_pool.shutdown()

Comment on lines +67 to +98
def get_committable_offsets(self) -> dict[Partition, int]:
"""
Get the highest offset per partition that can be safely committed.

For each partition, finds the highest contiguous offset that has been processed.
"""
committable = {}
for partition in list(self.all_offsets.keys()):
with self._get_partition_lock(partition):
all_offsets = self.all_offsets[partition]
if not all_offsets:
continue

outstanding = self.outstanding[partition]
last_committed = self.last_committed.get(partition, -1)

min_offset = min(all_offsets)
max_offset = max(all_offsets)

start = max(last_committed + 1, min_offset)

highest_committable = last_committed
for offset in range(start, max_offset + 1):
if offset in all_offsets and offset not in outstanding:
highest_committable = offset
else:
break

if highest_committable > last_committed:
committable[partition] = highest_committable

return committable

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

Kafka partition offsets contiguous gaps sparsity

💡 Result:

Kafka offsets are per-partition, monotonically increasing log positions, but they are not guaranteed to be “dense” (i.e., you should not assume every integer offset corresponds to an application-visible record).

Why you can see gaps / sparsity in a partition’s offsets:

  • Transactional control records (“commit/abort markers”) consume offsets but are not delivered to applications. Kafka writes control records into the log, and consumers use them to handle transactions; they “take up” offsets even though your app never sees them. [1]
  • Aborted transactional records can be skipped (especially with read_committed). With READ_COMMITTED, clients are given enough information (e.g., aborted-transaction metadata; LSO/last stable offset concepts) to discard ABORTED transactional data, which can make offsets appear to jump. [2]
  • Log compaction removes records but does not renumber offsets. After compaction, “missing” offsets remain valid positions; a fetch starting at a compacted-away offset returns the next higher offset that still exists. This creates long-lived holes in older parts of the log. [3]
  • At-least-once / retries / duplicates don’t create gaps, but they break “offset == message count” assumptions. Extra physical records (including duplicates and transaction markers) mean offsets can advance faster than your perceived logical stream. [4]

Practical takeaway: treat offsets as opaque positions used for ordering and checkpointing, not as a contiguous sequence of “real messages.” The only safe assumption is ordering within a partition and uniqueness of offsets within that partition, not density. [3][4]

Sources:
[1] https://kafka.apache.org/41/implementation/message-format/
[2] https://kafka.apache.org/25/design/protocol/
[3] https://docs.confluent.io/kafka/design/log_compaction.html
[4] https://stackoverflow.com/questions/54636524/kafka-streams-does-not-increment-offset-by-1-when-producing-to-topic


Performance concern with sparse offset iteration.

The get_committable_offsets method iterates through range(start, max_offset + 1) which could be expensive if offsets are very sparse. Kafka offsets are not guaranteed to be contiguous — gaps can occur due to transactional control records, aborted transactions, log compaction, and other factors. While the loop breaks on the first gap (limiting the iteration to the first contiguous block), offsets with significant sparse regions could still impact performance. If this becomes a bottleneck, consider tracking offsets in a sorted data structure to avoid iterating through absent offsets.

🤖 Prompt for AI Agents
In `@src/sentry/remote_subscriptions/consumers/queue_consumer.py` around lines 67
- 98, The get_committable_offsets implementation currently uses range(start,
max_offset+1) which can iterate over many absent offsets; replace that
range-based scan with an iteration over only the present offsets (sort the
partition's all_offsets and filter >= start) and detect contiguity by tracking
the next expected offset. Specifically, in get_committable_offsets, for each
partition use sorted_offsets = sorted(o for o in all_offsets if o >= start), set
expected = start, then loop sorted_offsets: if offset == expected and offset not
in outstanding update highest_committable and expected += 1 else break; keep
using last_committed, outstanding, all_offsets and highest_committable as
before.

Sign in to join this conversation on GitHub.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant